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(54) Memory and instructions in computer architecture containing proceeeor and coprocessor 



(57) A computer system is provided which compris- 
es a first processor 1 , a second processor 2 for use as 
a coprocessor to the first processor 1 and a memory 3. 
There are also provided data buffers 5 for buffering data 
to be written to, or read from, the memory 3 in data 
bursts in accordance with burst instructions. These 
burst instructions are executed by a burst coHtroller 7, 
and are provided in sequence' for^ .execution by X^^urst 
instruction queue 6. Burst instructions are provided by 



the first processor 1 to the burst instructions queud 6, 
and data is read from, and written to, the memory 3 by 
the second processor 2 through the data buffers 6 in 
accordance with burst instructions executed by the burst 
controller 7. Coprocessor instructions are provided to 
control execution of the coprocessor 2, and synchroni- 
sation between coprocessor instructions and burst in- 
structions is achieved by a synchronisation mechanism 
1 0,11 and use of specific coprocessor and burst instruc- 
tions. 
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Description 

FIELD OF INVENTION 

5 [0001] The invention relates to computer architectures involving a main processor and a coprocessor, and in partic- 
ular to use of memory resources by the coprocessor In such ^rchitectures. 

DESCRIPTION OF PRIOR ART 

10 [0002] Microprocessor-based computer systems are typically based around a general purpose microprocessor as 
CPU. Such microprocessors are well adapted to handle a wide range of computational tasks, but they are irievitably 
not optimised foralllasksrWhere tasks are computationally intense (such as media processing) then the CPU will 
frequently not be able to perform acceptably ^ 

[0003] One of the standard approaches to this problem is to use copropessors specifically adapted to handle indi- 
16 vidual computationally difficult tasks. Such coprocessors can be built using A^ICs (Application Specific Integrated 
Circuits). These are built for specific computational tasks, and can thus be optimised for such tasks. They are however 
Inflexible in use (as they are designed for a specific task alone) and are typically slow to produce. Improved solutions 
can be found by construction of flexible hardware which can be programmed with a configuration particularly suited to 
a given computational task, such as FPGAs (Field Programmable Gate Arrays). Further flexibility is achieved if such 
20 structures are not only configurable, but reconfigurable. An example of such a reconfigurable structure Is the CHESS 
array discussed in International Patent Application No. GB98/00262, International Patent Application No. GB98/00248, 
US Patent Application No. 09/209,542, filed on 11 December 1998, and its European equivalent European Patent 
Application No. 98309600.9. 

[0004] Although use of such coprocessors can considerably improve the efficiency of such computation, conventional 
25 architectural arrangements can inhibit the effectiveness of coprocessors. It is desirable to achieve an arrangement in 
which computations can be still more effectively devolved to coprocessors, particularly where these computations 
involve processing of large quantities of data. 

Summary of Invention 

30 

[0005] Accordingly, there is provided a computer system, comprising: a first processor; a second processor for use 
as a coprocessor to the first processor; a memory; at least one data buffer for buffering data to be' written to or read 
from the memory In data bursts in accordance with burst instructions; a burst controller for executing the burst instruc- 
tions; and a burst instructions element for providing burst instructions In a sequence for execution by the burst controller; 

35 whereby burst instructions are provided by the first processor to the burst Instructions element, and data is read from 
and written to the memory by the second processor through the at least one data buffer In accordance with burst 
instructions executed by the burst controller. - i 

[0006] This arrangement is particularly advantageous where the coprocessor is to work on large blocks of data, 
particularly where the memory addresses of such blocks vary regularly This arrangement allows for such blocks to be 

40 moved effectively in and out of the main memory with minimal involvement of the main processor (which is the system 
component least well adapted to use them). 

[0007] A particularly efficient structure can be achieved if the coprocessor is controlled In a similar way to the data 
buffers. This can be done with a coprocessor instructions element for providing coprocessor instructions to control 
execution of the second processor In a sequence (with said coprocessor Instructions originally provided by the first 

46 processor). Advantageously a coprocessor controller receives the coprocessor Instructions from the coprocessor in- 
structions element and controls execution of the second processor accordingly This coprocessor controller may control 
communication between the coprocessor and the at least one data buffers: for example, where a bus exists between 
the coprocessor controller and the data buffers, the coprocessor controller may control access of separate data streams 
in and out of the second processor to the bus. 

60 [0008] Particular benefit can be gained if there is a synchronisation mechanism for synchronising execution of the 
coprocessor and of burst instructions with availability of data on which the coprocessor and the burst instructions are 
to execute. This is particularly well accomplished if the coprocessor executes on the basis of coprocessor instructions. 
An effective approach is tor the synchronisation mechanism to be adapted both to block execution of coprocessor 
instructions requiring execution of the second processor on data which has not yet been loaded to the data buffers. 

55 and to block execution of burst instructions for storage of data from the data buffers to the memory where such data 
has not been provided to the data buffers by the second processor. A particularly effective way to implement the 
synchronisation mechanism is to use counters which can be incremented or decremented through appropriate burst 
and coprocessor instructions, and which block particular instructions if they cannot be decremented further. 



2 



30CiD:<EP 1061439A1 I > 



II * *' . , f * » 

♦ < • • • .' 

EP 1 061 439 A1 

[0009] ♦ \n a, further aspect the invention provides a method of operating a computer system, comprising: providing 
cocle for execution by a first processor; extraction from the code of a task to be carried out by a second processor 
acting 3S coprocessor to the first processor; determining from the code and the task burst instructions to allow data to 
be read from,and written to a main memory In data bursts for access by the second processor by means of at least 
5 one data, buffer; and execution of the task on the coprocessor together with execution of burst Instructions by a burst 
controller controlling transfer of data between the at least one data buffer and the main memory. 
[0010] Advantageously, following extraction of the task from the code, coprocessor instructions for execution by a 
coprocessor controller are determined to control execution of the task by the second processor 
[0011] It is further advantageous if in execution of the task, synch ronisatior^ between execution of coprocessor In- 
to structions and execution of burst instructions is achieved by a synchronisation mechanism. This synchronisation mech- 
anism may usefully comprise blocking of first instructions until second instructions whose completion is necessary for 
correct execution of the first Instructions have completed. This mechanism may employ counters which can be incre- 
mented or decremented through appropriate burst or coprocessor instructions. 

16 BRIEF DESCRIPTION OF FIGURES 

[0012] Specific embodiments of the invention will be described further below, by way of example, with reference to 
the accompanying drawings. In which: 

20 Figure 1 shows the basic elements of a system in accordance with a first embodiment of the invention; 

Figure 2 shows the architecture of a burst buffers structure used In the system of Figure 1; 

Figure 3 shows further features of the burst buffers structure of Figure 2; 

2S 

Figure 4 shows the structure of a coprocessor controller used in the system of Figure 1 and its relationship to other 
system components; 

Figure 5 shows an example to illustrate a computational model usable on the system of Figure 1; 

30 

Figure 6 shows a timeline for computation and I/O operations for the example of Figure 5; 

Figure 7 shows an annotated graph provided as output from the frontend of a toolchain useful to provide code for 
the system of Figure 1; 

36 

Figure 8 shows a coproce'ssqc internal configuration derived from the specifications in Figure 7; 

Figure 9 shows the performance of alternative architectures for a 5x5 image convolution using 32 bit pixels; 

40 Figure 10 shows the performance of the alternative architectures used to produce Figure 9 for a 5x5 image con- 

volution using 8 bit pixels; 

Figures 11 A and 11 B show alternative pipeline architectures employing further embodiments of the present inven- 
tion; 

46 

Figure 12 shows two auxiliary processors usable as an alternative to the coprocessor Instruction queue and the 
burst instruction queue In the architecture of Figure 1; and 

Figure 1 3 shows implementation of a state machine as an alternative to the coprocessor Instruction queue in the 
60 architecture of Figure 1 . 

DESCRIPTION OF SPECIFIC EMBODIIVIENTS 

[0013] Figure 1 shows the basic elements of a system in accordance with a first embodiment of the invention. Es- 
66 sentially, the system comprises a processor 1 and a coprocessor 2. established so that a calculation can be partitioned 
between the processor 1 and the coprocessor 2 for greatest computational efficiency. The processor 1 may be essen- 
tially any general purpose processor (for example, an i960) and the coprocessor 2 essentially any coprocessor capable 
of handling with significantly greater effectiveness a part of the calculation. In the specific system described here, 
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essentially the whole computation is to be handled by the coprocessor 2, rather than by the processor 1 - however 
the invention is not limited to this specific arrangement. 

[0014] In the system specifically described, coprocessor 2 is a form of reconfigurable FPGA, as will be discussed 
further below - however, other forms of coprocessor 2, such as. for example, ASICS and DSPs, could be employed 
Instead (with corresponding modifications to the computational modef required). Both the processor 1 and coprocessor 
2 have access to a DRAM main memory 3, though the processor 1 also has access to a cache of faster access memory 
4. typically SRAM, Efficient access to the DRAM 3 Is provided by "btlrst buffer" memory 5 adapted to communicate 
with DRAM for the efficient loading and storing of "bursts" of information - burst buffers will be described further below. 
Instructions to the burst buffers 5 are provided through a burst instruction queue 6, and the burst buffers 5 operate 
under the. control of a burst buffer controller 7. The architecture of th.e burst buffers Is mirrored,, fot reasons discussed 
below, in the architecture associated with the coprocessor 2. Instructions to the copropessor 2 are provided' In a co- 
processor instruction queue 8, and the coprocessor operates under the control of a coprocessor controller 9. Synchro- 
nisation of the operation of the burst buffers and the coprocessor and their associated Instruction queues is achieved 
by a specific mechanism, rather than in a general manner by processor ) itself. In this embodiment, the mechanism 
comprises the load/execute semaphore 10 and the execute/store semaphore 11, operating in a manner which will be 
described below (other such synchronisation mechanisms are possible, as will also be discussed). 

Description of Elements In System Architecture 



[001 5] The Individual elements of the system will now be discussed In more detail. The processor 1 generally controls 
the computation, but in such a manner that some (or, in the embodiment described, all) of the steps in the computation 
itself are carried out in the coprocessor 2. The processor 1 provides, through the burst instruction queue 6, instructions 
for particular tasks: configuration of the burst buffer controller 7; and transfer of data between the burst buffer memory 
5 and the main memory 3. Furthermore, through fhe coprocessor Instruction queue 8. the processor 1 also provides 
Instructions for further tasks: configuration of the coprocessor controller 9; and Initiation of a computation on coproc- 
essor 2. This computation run on coprocessor 2 accesses data through the burst buffer memory 5. 
[0016] The use of the coprocessor instruction queue 8 effectively decouples the processor 1 from the operation of 
coprocessor 2, and the use of the burst instruction queue 6 effectively decouples the processor 1 from the burst buffers 
5. The specific detail of this arrangement Is discussed in greater detail below. This decoupling will be discussed further 
30 below In the context of the computational model for this embodiment of the Invention. 

[0017] The coprocessor 2 performs some or all of the actual computation. A particularly suitable coprocessor is the 
CHESS FPGA structure, described in International Patent Application No. GB98/00262, International Patent Applica- 
tion No. GB98/00248, US Patent Application No. 09/209,542. filed on 1 1 December 1 998, and Its European equivalent 
European Patent Application No. 98309600.9, the contents of which applications are incorporated by reference herein 
35 to the extent permissible by law. This coprocessor Is reconfigurable, and comprises a checkerboard array of 4-blt ALUs 
and switching structures, whereby the coprocessor is configurable that an output from one 4«bit ALU can be used to 
instruct another ALU. The CHESS architecture is particularly effective .for. pipe lined calculations, and ip effectively 
adapted here to interact with Input and output data streams. The coprocessor controller 9 (whose operation will be 
discussed further below) receives high-level control Instructions (instructions for overall control of the coprocessor 2, 
40 rather than Instructions relating to detail of the calculation - e.g. "run for n cycles") from the coprocessor instruction 
queue 8. The CHESS coprocessor 2 runs under the control of the coprocessor controller 9 and receives and stores 
data through interaction with the burst buffers 5. The CHESS coprocessor 2 thus acts on input streams to produce an 
output stream. This can be an efficient process because the operation of the CHESS coprocessor Is highly predictable. 
The detailed operation of conr)putatlon according to this model Is discussed at a later point. 
45 [0018] The processor 1 has access to a fast access memory cache 4 in SRAM In a conventional manner, but the 
main memory is provided as DRAM 3. Effective access to the DRAM is provided by burst buffers 5. Burst buffers have 
been described in European Patent Application No. 97309514.4 and corresponding US Patent Application Serial No. 
09/3.526. filed on 6 January 1998. which applications are incorporated by reference herein to the extent permissible 
by law. The burst buffer architecture Is described briefly herein, but for full details of this architecture the reader Is 
so referred to these earlier applications. 

[0019] The elements of the version of the burst buffers architecture (variants are available, as is discussed in the 
aforementioned application) used in this embodiment are shown in Figures 2 and 3. A connection 12 for allowing the 
burst buffers components to comnnunicate with the processor 1 is provided. Memory bus 16 provides a connection to 
the main memory 3 (not shown in Figure 2). This memory bus may be shared with cache 4, In which case memory 
S5 datapath arbiter 58 is adapted to allow communication to and from cache 4 also. 

[0020] The overall role of burst buffers in this arrangement is to allow computations to be performed on coprocessor 
2 Involving transfer of data between this coprocessor 2 and main memory 3 in a way that both maximises the efficiency 
of each system component and at the same time maximises the overall system efficiency. This is achieved by a com- 
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bination of several techniques: 

burst accesses to DRAM, using the burst buffers 5 as described below; i 

s simultaneous execution of computation on coprocessc^ and data transfers between main memory 3 and burst 

buffer memory 5, using a technique called "double buffering"; ana 

decoupling the execution of processor 1 from the execution of coprocessor 2 and burst buffer memory 5 through 
- - use of the Instruction xjueuesr ' • " , 

JO 

[0021] "Double buffering" is a technique known in. for example, computer graphics. In the form used here it involves 
consuming - reading - data from one part of the burst buffer mqmory.S, wliile producing - writing - other data into a 
different region of the same memory, with a switching mechanism to allow a region earliisr written to now to be read 
from, and vice-versa. ' , 

[0022] A particular benefit of burst buffers is effective utilisation of a feature of conventional DRAM construction. A 
DRAM comprises an array of memory locations in a square matrix. To access an element In the array, a row must first 
be selected (or 'opened'), followed by selection of the appropriate column. However, once a row has been selected, 
successive accesses to columns in that row may be performed by just providing the column address. The concept of 
opening a row and performing a sequence of accesses local to that row Is called a "burst". When data is arranged in . 
20 a regular way, such as In media-intensive computations (typically involving an algorithm employing a regular program 
loop which accesses long arrays without any data dependent addressing), then effective use of bursts can dramatically 
increase computational speed. Burst buffers are new memory structures adapted to access data from DRAM through 
efficient use of bursts. . • 

[0023J A system may contain several burst buffers. Typically, each burst buffer is allocated to a respective data 
2S stream. Since algorithms have a varying number of data streams, a fixed amount of SRAM 26 Is available to the burst 
buffers as a burst buffer memory area, and this amount is divided up according to the number of buffers required. For 
example, If the amount of fixed SRAM is 2 Kbytes, and if an algorithm has four data streams, the memory region might 
be partitioned into four 512 Byte burst buffers. 

[0024] In architectures of this type, a burst comprises the set of addresses defined by: 

30 

burst = {B + S X i I B.S.i e W a 0 ^ l< L} 

where Sis the base address of the transfer, S is the stride between elements, L is the length and A/ is the set of natural 
3S numbers. Although not explicitly defined in this equation, the burst order Is defined by / incrementing from 0 to L-1. 
Thus, a burst may be defined by the 3-tuple of: ' 
(bas0__address, length, stride) 

[0025] In software, a burst may also be defined by the element size. This Implies that a burst maybe sized in bytes, 
halfwords or words. The units of stride must take this into account. A "sized-burst" is defined by a 4-tuple of the form: 
40 {base^address, length, stride, size) 

[0026] A "channel-burst" is a sized-burst where the size is the width of the channel to memory. The compiler is 
responsible for the mapping of software sized-bursts into channel-bursts. The channel-burst may be defined by the 
4-tup!e: 

{base_address, length, stride, width) 
4S [0027] If the channel width is 32 bits (or 4 bytes), the channel-burst is always of the form: 
(base_address, length, stride, 4) 

or abbreviated to the 3-tuple (base_address, length, stride). 

[0028] The control of this memory and the allocation (and freeing) of burst buffers is handled at a higher level by a 
software process. In the present embodiment, "double buffering" Is used, but other strategies are certainly possible - 

so the decision involves a trade-off between storage efficiency and simplicity. The burst buffer memory area' 26 loads data 
from and stores data to the main memory 3 through memory datapath arbiter 58, which operates under control of DMA 
controller 56, responsive to instructions received through the burst instruction queue 6. Data is exchanged between 
the burst buffer memory area 26 and the processor 1 or the coprocessor 2 through the connection means 1 2. As shown 
in Figure 3, the control interface for the burst buffers system 5 is based around a pair of tables: a Memory Access Table 

ss (MAT) 65 describing regions of main memory for bursting to and from the burst buffer memory, and a Buffer Access 
Table (BAT) 66 describing regions of burst buffer memory. In this embodiment, a homogeneous area of dual-port SRAM 
is used for the burst buffer memory area 26. 

[0029] A burst buffers arrangement which did not employ MATs and BATs (such as Is also described in European 
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Patent Application No. 9730951 4.4) could be used in alternative embodiments of the present invention - the parameters 
implicitly. encpded in MATs and BATs (source address, destination address, length, stride) would th^n have to be ex- 
plicitly specified tor every burst transfer issued. The main reason to use MATs and BATs.' rather than straightfonvard 
addresses., lengths and strides, is that this significantly reduces the overall code size. In the context of the present 
inversion,' this Is typically useful, rather than critical. 

[0030] .Burst instructions originating from the processor 1 are provided to the burst buffers 5 by means of a burst 
Instruction queue fe. Instructions from the burst instruction queue 6 are processed by a buffer control element 54 to 
reference slots In the MAT 65 and the BAT 66. The buffer controller also receives control inputs from eight burst control 
registers 52. Information contained in these two tables Is bound together at run time to describe a complete main- 
memory.to-burst-buffer transaction. Outputs are provided from the buffer controller 54 to direct memory access (DMA) 
controller 56 and hence to the memory datapath arbiter 58 to effect transactions between the main memory 3 and the 
burst buffers memory area 26. , 

[0031] The key bunst Instructions are those used to load data from main memory 3 to the burst buffer memory area 
26. and to store ?Jata from the burst buffer memory area 26 to the main memory 3. These instructions are "loadburst" 
and "storeburst". The loadburst instruction causes a burst of data words to be transferred from a determined location 
in the memory 3 to that one of the burst buffers. There is also a corresponding storeburst ihstructlon, which causes a 
burst of data words to be transferred from that one of the burst buffers to the memory 3, beginning at a specific address 
In the memory 3. For the architecture of Figure 1, additional synchronisation instructions are also required - these are 
discussed further below. 

[0032] The instructions loadburst and storeburst differ from normal load and store instructions in that they complete 
in a single cycle, even though the transfer has not occurred. In essence, the loadburst and storeburst Instructions tell 
the memory interface 1 6 to perform the burst, but they do not wait for the burst to complete. 

[0033] The fundamental operation Is to issue an instruction which indexes to two table entries, one in each of the 
memory access and buffer access tables. The index to the memory access table retrieves the base address, extent 

2S and stride used at the memory end of the transfer. The index to the buffer access table retrieves the base address 
within the burst buffer memory region. In the embodiment shown, maslcing and offsets are provided to the Index values 
by a context table (this Is discussed further in European Patent Application No. 9730951 4.4), although it is possible to 
use actual addresses instead. The direct memory access (DMA) controller 56 is passed the parameters from the two 
tables and uses them to specify the required transfer. 

30 [0034] Table 1 shows a possible Instruction set. 
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I Opcode . 
1 ' ' ' 


Parameter Value 


Comxi|ent | 


BB_LOADBURST 

• - 


mat_index (integer), 
bat_index (integer), 
bloclc_incremeftt (boolean) 


Load a burst of data into the burst 
buffer memory from main 

mcijnoryj and" optionally 

increments the base address ,in 
main memorv 


BB_STOREBURST 


mat_index (integer), 
Dat^maex ^mtegerj, 
block_increment (boolean) 


Store a burst of data into main 
mnnrktv from Uie buTst buffer 
memory, and optionally 
increments the base address in 
main memory 




BB_LX__lNCRliMJbN 1 




Increment the value of the LX 
semaphore 


BB XS DECREMENT 


N/A 


Decrement the value of the XS 
semaphore 


BB_SET_MAT 


entry (integer), memaddr (integer), 
extent (integer), stride (integer) 


Sets a MAT entry to the desired 

values 


BB_SET_BAT 


entry (integer), bufaddr (integer), 
extent(integer) 


Sets a BAT entry to the desired 
values 



$s 



^ Table 1 : Instruction set for burst buffers 



40 



4S 



SO 



SS 



[0035] The storeburst instruction {BB_STOREBURST) Indexes paranneters in tlie MAT and BAT. which define the 
characteristics of the requested transfer. If the Woc/c./ncremenf bit is set, the memaofdrfield of the indexed entry in the 
MAT is automatically updated when the transfer completes (as is discussed below). 

[0036] The toadbt/raf instruction (BB^LOADBURST) also indexes parameters in the MAT and BAT. again which define 
the characteristics of the required transfer. As before, if the bfocUncrement bit is set, the memaddrfield of the indexed 
entry In the MAT is automatically updated when the transfer completes. 

[0037] The synchronisation instructions needed are provided as Load-Execute Increment and eXecute-Store Dec- 
rement (BB_LXJNCREMENT and BB_XS_DECREMENT). The purpose of BB^LXJ NCREMENT is to make sure that 
the execution of coprocessor 2 on a particular burst of data happens after the data needed has arrived into the burst 
buffer memory 5 following a loadburst instruction. The purpose of BB_XS^DECREMENT is to make sure that the 
execution of a stor©^u«sf instruction follows the completion of the calculation (on the coprocessor 2) of the results that 
are to be stored back into main memory 3. 

[0038] In this embodiment, the specific mechanism upon which these instructions act is a set of two counters that 
track, respectively: 

the number of regions in burst buffer memory 5 ready to receive a storeburst; and 
. the number of completed loadburst instructions.. 

[0039] Requests for data by the coprocessor 2 are performed by decrementing the LX counter, whereas the availa- 
bility of data is signalled by incrementing the XS counter. These counters have to satisfy two properties: they must be 
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embodiment, the BB_XS_DECREMENT instruction would act like a Vteit () on the XS sema 
pho e (11 ,n Figure 1) whereas the BB.LXJNCREMENT instruption would act like a Signal () on ii^lS semaphore 
ilmlh^ . ""'L^^ r '^'^ ^'ntroller 9 would, conversely pe "o°m a Watt (fon the LX 

drcTh^Htn Hi-rf f ^'^"^ ^^.r «e"'aphore 11. The semantics of these instructions can be the ime a^ 

frnm hft J t ? f ''^^ ^^^^"Sement of Signal () and W^it () operations differ sfgSanSJ 

H / ^r'?f P^P"'- ^'""^ instructions would be issued in the appropriate !equencX?r- 

cT::rsrh:':r^^^^^^ ^-^^ relat..temporal ordering of ce^r^in^vents.^nersi^fLrl 

f^^L!^^"^? ^""T^ "^^"'^ ^'^^^^ '^^^'^'"^^d ^i*'^ '° F'9"re 3. This is a memory descriotor 

Lh^vIh '^"'"^ *° '^"^^^ transactions. Each ent^in ^he M^Tan 

mdexed slot descnbmg a transaction to main memo^.. In this embodiment, the MAT 65 comprises enWes^ough 
different implementations are of course possible. Each entry comprises three fields: ^ 

I 

1, Memory address rmemadcTr; - the start address of the relevant region in main memory. Ideally, this location is 
in physica memory space, as virtual address translation may resutt in a burst request spanning twc^^phVsSaQes 
which would cause difficulties for the memory controller. > iing iwo pnysicai pages. 

'"^ "^'^ '^"9*^ °* "multiplied by the stride, and gives 

Sridf l^nti P The length of the transfer is calculated by the division of the extent by the 

stride and this is automatically copied to the bufsfze field of the related BAT 66 (see below) after a transfer has 
completed. , 

3. Stride (stride) -\he interval between successive elements in a transfer. 

memacfdr This is the 32 bit unsigned, word-aligned address of the first element of the channel burst. 

exfenf.- The parameter in the exfenf register is the address offset covering the range of the burst transfer If the 
transfer requires L elements separated by a stride of S. then the extent is S*L 

stride:lhe parameter stride is the number of bytes skipped between accesses, \felues of the transfer stride inten/al 
are restricted in the range of 1 to 1024. Values greater than 1024 are automatically truncated to 1024 Reads of 
this register return the value used for the burst (i.e. if truncatkin was necessary, then the truncated value is returned) 
Also, stndes must be muljiples of the memory bus width, whteh in this case is 4 bytes. Automatic truncation (without 
rounding) is performed to enforce this alignment 

[0044] An example of values contained by a MAT slot might be- 
{Oxifeelbad, 128, 16} . 

which results in a 32 word (32 4 byte words) burst, with each word separated by 4 words (4 4 byte words) 
[0046] The auto-increment indicator bit of a burst instruction also has relevance to the MAT 65 If this btt is set in the 
burst instruction, the start address entry is increased to point to point to the next memory location should the burst 
have continued past 32. This saves processor overhead In calculating the start address for the next burst in a tona 
sequence of memory accesses. ** 
[0046] The buffer access table (BAT) 66 will now be described wtth reference to Figure 3. This is again a memory 
descriptor table, in this case hofcfing information relating to the burst buffer memory area 26. Each entry in the BAT 66 
describes a transaction to the burst buffer memory area 26. As for the MAT 65, the BAT 66 comprises 16 entries 
though can of course be varied as for, the MAT 65. Each entry in this case comprises two fields- 
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1 . Buffer address (bufaddr) - the start of the buffer In the buffer area 

2. Buffer size {bufsize) - the size of the buffer area used at the last transfer < 

s [0047] The buffer address parameter bufaddr \s the offset address for the first elennent of the channel-burst in the 
buffer area. The burst buffer area is physically mapped by hardware into a region of the processor's memory space. 
This means that the processor must use absolute addresses when accessing the burst buffer area. However. DMA 
transfers simply use the offset, so it is necessary tor hardware to manage any address resolution required. Illegally 
aligned-values-may be-automatically aligned by truncation.' Reads of this register return the value used for the burst 

io (i.e. If truncation was necessary, then the truncated value is returned). The default value is 0. 

[0048] The parameter bufsize is the size of the region within the buffer area occupied by the most recent burst. This 
register is automatically set on the completion of a burst transfer, which targeted its entry. Note that the value stored is 
the burst length, since a value of 0 indicates an unused buffer entry. This register may be written, but this is only useful 
after a context switch when buffers are saved and restored. The default value is again 0. , 

IS [0049] Programming MAT and BAT entries is-performed through the use of BB^SET-MAT and BB_SET_BAT in- 
structions. The entry parameter determines the entry in the MAT (or BAT) to which the current instruction refers. 
[0050] Further details of the burst buffer architecture and the mechanisms for its control are provided in European 
Patent Application No. 97309514.4 and the corresponding US Patent Application Serial No. 09/3.526. The details 
provided above are primarily intended to show the architectural elements of the burst buffer system, and to show the 

20 functional effect that the burst buffer system can accomplish, together with the inputs and outputs that it provides. The 
burst buffer system is optimally adapted for a particular type of computational model, which is developed here into a 
computational model for the described embodiment of the present invention. This computational model is described 
further below. ^ ' 

[0051] The burst instruction queue 6 has been described above. A significant aspect of the embodiment is that 

2S instructions are similarly provided to the coprocessor through a coprocessor instruction queue 8. The coprocessor 
instruction queue 8 operates in connection with the coprocessor controller 9. which determines how the coprocessor 
receives instructions from the processor 1 and how it exchanges data with the burst buffer system 5. 
[0052] Use of the coprocessor instruction queue 8 has the important effect that the processor 1 itself is decoupled 
from the calculation itself. During the calculation, processor resources are thus available for the execution of other 

30 \asks. The only situation which could lead to operation of processor 1 being stalled is that one of the instruction queues 
6.8 is full of instructions. This case can arise when processor 1 produces instructions for either queue at a rate faster 
than that at which Instructions are consumed. Solutions to this problem are available. Effectiveness can be improved 
by requiring the processor 1 to perform a context switch and return to sen/ice these two queues after a predefined 
amount of time, or upon receipt of an interrupt triggered by the fact that the number of slots occupied In either queue 

35 has decreased to a predefined amount. Conversely, if one of the two queues becomes empty because the processor 
.1 cannot keep up with the rate at which instructions are consumed, the consumer of those instructions (the coprocessor 
controller 9 or the burst buffer controller 7) will stall until new instructions are produced by the processor 1. 
[0053] Modifications can also be provided to the architecture which ensure that no further involvement from the 
processor 1 is required at all, and these will be discussed in the final part of this specification. 

40 [0054] The basic functions of the coprocessor controller 9 are to fetch data from the burst buffer memory 5 to the 
coprocessor 2 (and vice versa), to control the activity of the coprocessor, and to synchronise the execution of the 
coprocessor 2 with the appropriate loads from, or stores to, the burst buffer memory 5. To achieve these functions, the 
coprocessor controller may be in essence a relatively simple state machine able to generate addresses according to 
certain rules. 

4S [0055] Figure 4 shows the coprocessor controller 9 in its relationship to the other components of the architecture, 
and also shows its constituent elements and its connections with other elements In the overall architecture. Its exact 
function depends on the type of inputs and outputs required by the coprocessor 2 and its initialisation requirements (if 
any), and so may vary in detail from that described below. In the case of a CHESS coprocessor, these inputs and 
outputs are input and output data streams exchanged with the burst buffer memory 5. 

so [0056] Coprocessor controller 9 performs two main tasks: 

control of the communication between the coprocessor 2 and the burst buffer memory 5; and 
maintenance of a system state through the use of a control finite state machine 42. 

ss [0057] The coprocessor 2 accesses data in strearns. each of which is given an association with one of a number of 
control registers 41 . Addresses for these registers 41 are generated in a periodic fashion by control finite state machine 
42 with addressing logic 43, according to a sequence generated by the finite state machine 42. 
[0058] At every tick of a clock within the finite state machine 42, the finite state machine gives permission for (at 
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THHlTfK K ?2'T^ '° ^^''^ ^ "^"^ generated for it and the address used to allow the reqister 41 to 

T ^""^ ^" «PP^°P^i«te control signal is generated b7the f n re stL e 

f ^ "^""'P'^^^^ ^° the appropriate address is sent to the bur^t buffer memor^ 5 toethS 

™ with each re^ste. 41 . With a^vL^S 

[0059] .After an address obtained for a register 41 hps been used to address memory a constant auantitv is add«ri 

5. That IS if the width of this connection is 4 bytes, then the Increment made to counter 41 will be 4 This is essentialK^ 
comparable to "stride" in the programmirig of burst buffers. essentially 

^"t "'P^^f «f r.'^oi'fo'ler mechanism described above allowa^he multiplexing of different data streams 
Fo'r tht"Lr't°' ''f ^ '^rr '^^"^"'^^^^ *° ^^'^^^^ the'single Shared bus thro'ghTown poT 

end of the bus the coprocessor 2 is ready to read from and write to this bus in a synchronous manneT It is the resZ 



no two streams try and access the bus at the same time; and that 

the execution of coprocessor 2 is synchronous with the data transfer to and from burst buffer memory 5. 

[0062] This latter requirement ensures that the coprocessor 2.1s ready to read the data placed by the burst buffers 
memory 5 on the connection between the two devices, and vice-versa 

K m™"f f'^^^'^^' ""^ "'^^""y P^°^'^°^ "^'ween the Chess array 2 and the burst 

b^twLTrr 7 ■ ^^"T IT 1°' "^^"'P'^'^'^S would still remain. Unless the number of physical connections 
between the coprocessor 2 and the burst buffer memory 6 is greater than or equal to the total number of logical iS 

T hT°'T°' ^' " ^ °^ "^^^ '°9'<'«' ^'^"^"^ ^^^^ be multiplexed on the 

mlT J^'^"°'°9'^^' f^'^'ed to the design cif fast SRAf^ (as is advantageously used for the burst buffer 

memory 5) discourage the use of more than one connection with the coprocessor 2 

L°^1L,'?^r'''°?^^^°' ''°"r.^^' ^ ^''^^ ^° the CHESS array comprising coprocessor 

J, .oT ^ ^P®'""®'^ °^ ''^^ is achieved by the counter in the control finite state 

machine 42 icking for the specified number of cycles before "freezing- the CHESS array by "gating" (that is. stopping) 
s jntemal clock, In a way that does not affect the internal state of the pipelines In the coprocessor 2.. This number o 
ticks IS specified using the CC_START_EXEC instruction, described below numoer oi 

[0066] Coprocessor controller 9 Is programmed by processor 1 through the use of the coprocessor instruction queue 
B. A possible instruction set for this coprocessor controller 9 is shown in Table 2 below. 



40 



4S 



so 



55 



10 

^0OCI0:<EP 1061439A1 I > 



EP* 1 061 439 Ai 
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Opcode 


Parameter Value 


Comment^ 1 

■ •■ ■ ii 




CC_CURRENT__PORT 


n (integer) 


Port # the next cc_rUKi_xxx 
coimxiBuds will refer to 




GC PORT PERIOD 


Yinte&erV' 


. Period of activity-of-a port - 


10 


r'P POllT PHASE START 


start v"**'^^^*^/ 


Phase start of the activity of a port 




CC PORT PHASE_END 


finteaer^ 


Phase end of the activity of a port 


IS 


CC PORT T1ME_START 


t:tart finteeer) 


Start cycle of the activity of a port , 




CC_PORT_TIME„END 


tend (integer) 


End cycle of the activity of a port 




CC_PORT_ADDRESS 


addr,tart (integer) 


Initial address for a port 


20 


CC_PORT_ INCREMENT 


addTend (integer) 


Address increment for a port 




CC_PORT_ IS_WRITE 


rw (boolean) 


ReadAVrite Hag 


I 


CC_START_EXEC 


ticycies (integer) 


Start/Resume the execution of 
coprocessor 2 for a detennined # of 
cycles 


30 


CC_LX_DECREMENT 


N/A 


Decrement the value of the LX 
semaphore 


CC_XS_INCREMENT 


N/A 


Increment the vahie of , the XS 
semaphore 



Table 2: Coprocessor controller instruction set 



[0066] For the aforementioned Instructions, different clioices of instruction format could be made. One possible for- 
mat is a 32-bit number, in which 16 bits encode the opcode, and 16 bits encode the optional parameter value described 
40 above. 

[0067] The semantics of individual instructions are as follows: 

• CC_CURRENT_PORT selects one of the ports as the recipient of all the following CC_PORT_xxx instructions, 
until the next CC_CURRENT_PORT 

45 • cc_PORT_PERIOD ( ) sets the period of activation of the current port to the value of the integer parameter 

• CC„PORT_PH ASE_START/CC_PORT_PHASE.END ( etar t ©n d) set the start/end of the activation phase of the 
current port to the value of the integer parameter ( start end) 

• CC_PORT_TIME_START/CC_PORT„TIME.END (t^tart ^end) set the firsVlast cycle of activity of the current port 

• CC_PORT_ADDRESS (addrgtert) sets the current address of the current port to the value of the integer parameter 
so addrgtart 

• CC_PORTJNCREMENT (addring^) sets the address increment of the current port to the value of the Integer pa- 
rameter addrjocr 

• CC_PORTJS_WRITE (rw) sets the data transfer direction for the current port to the value of the Boolean parameter 
rw 

55 • CC_START_EXEC ncydes initiates the execution of coprocessor controller 2 for a number of clock cycles specified 
by the associated integer parameter ncyciesJ 

• CC_LXS_DECREMENT decrements (in a suspensive manner, as previously described) the value of the LX sem- 
aphore; 
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• CC^XSS_INCREMENT increments the value of the XS semaphore. 

Jn?f5/. 6 ^''""^ ^'"^^ " ^^^ °* communication with the burst buffer memory 5) if the 

sy :r.n" Xm^^^^^^^^ " T ^•-^ V ' <^-- ^'-^ Co. mod )< This aiiows the po^lSlSy o1 

thrfirS J .31 ■ K 1 ^''^^""^ """^ P^"°^' 5'^"^ f^^s '^oit^o' °f the BB memory for 
the firbt 4 cycles, and the other has control for the remaining cycle ■ 

tl«n r^*'^^ °- ^'9°"*'^'" "s'"9'this architecture involves first the programming of the coproc- 

th« P^°9^«"^"''';'9 'nt-alisation of the coprocessor controller 9 and the burst buffer controller 7. followed by 

the actual execution of the algorithm. uy 

ESI - P;;;-" the cpproces^ 2. If will generally be most straightforward for the configuration to be 

oaded into the coprocessor itself by means specific to the actual embodimen* of the device ' 
[0071] - ForiheTsrogrammingnoHhe-coprocessorcontroller 9;-the steps are as follows: 

• 1. The main coprocessor controller 9 is configured according to the total number, periods, phases and address 
increments for eveiy logical stream present In the Chess array, as described before. An example of the program- 
"^'ngPJjH Wfocessor controller 9 to perfomfi the desired functions is provided below 

2^ The next step in the configuration of coprocessor controller 9 is address configuration. Although it is likely that 
the characteristics (period, phase) of every logical stream will remain the same throughout ari algorithm, the actual 
addresses accessed by the coprocessor controller 9 in the burst buffers memory 5 will vary. It is this variability 
which allows the burst buffers controller 7 to perform double-buffering in a straightforward manner within the burst 
buffers architecture. The effect of this double-buffering, as previously stated, is to give the coprocessor 2 the 
Impression that it is interacting with continuous streams, whereas in fact buffere are being switched continuously. 

«5 [00721 The burst buffers controller 7 also needs to be configured. To do this, the appropriate commands have to be 
sent to me burst instruction queue 6 in order to configure the transfers of data to and from main memory 3 into the 
burst buffers memory 5. These instructions (BB_SET_MAT and BB_SET_BAT) configure the appropriate entries within 
the BAT and the IVIAT in a manner consistent with the programming of the coprocessor controller 9. In this embodiment 
the iristructions to program the MAT and the BAT entries are issued through the burst instruction queue 6. An altemative 
possibility would be the use of memory-mapped registers which the processor 1 would write to and read from As in 
the present embodiment there is no possibility of reading from memory-mapped registers (as they are not present) 
processor 1 cannot query the state of the burst buffer controller 7 - however, this, is not a significant limitation Further- 
more, the use of the burst instruction queue 6 for this purpose allows the possibility of interleaving instructions to 
configure MAT and BAT entries with the execution of burst transfers, thus maintaining correct temporal semantics 
as without the supeivision of the processor 1. 

[0073] After these steps have been performed, the actual execution of the CHESS array can be started. It is necessary 
in this embodiment only to instruct the CHESS array to run for a specified.number of cycles. This is achieved by writing 
the exact number of cycles as a parameter to a CC_START_EXEC instruction in the coprocessor Instruction queue 8 
so that this data can then be passed to the coprocessor controller 9. One clock cycle after this value has been transferred 
intocoprocessor controller 9. the controller starts transferring values between the burst buffer memory 5 and the CHESS 
array of coprocessor 2. and enables the execution of the CHESS array 

[0074] An important step must however be added before instructions relating to the computation are placed in the 
respective instruction queues. This is to ensure the necessary synchronisation mechanisms are in place to implement 
successfully the synchronisation and double-buffering principles. The basic element in this mechanism is that the 
coprocessor controller 9 will try to decrement the value of the LX semaphore and will suspend coprocessor operation 
until It can do so, according to the logic described above. The initial value of this semaphore is 0: the coprocessor 
controller 9 and the coprocessor 2 are hence frozen" at this stage. Only when the value of the LX semaphore is 
incremented by the burst buffers controller 7 after a successful loadburst instruction will the coprocessor 2 be able to 
start (or resume) its execution. Toachieve this effect, a CC_LX_DECREMENT instruction is inserted in the coprocessor 
instruction queue 8 before the "start coprocessor 2 execution" (CC_START_EXEC) instruction. As will be shown a 
corresponding "Increment the LX semaphore" (BB_LX_INCREMENT) instruction will be Inserted in the burst instruction 
queue 6 after the corresponding loadburst instruction. 

[0075] The actual transfer of data between CHESS logical streams and the burst buffer memory 5 Is carried out in 
accordance with the programming of the coprocessor controller 9 as previously described. 
BB [0076] The number of ticks for which the counter 42 has to run depends on how long it takes to consume one or 
more input bursts. It is left to the application software to ensure the correctness of the system. The programming of 
the counter 42 must be such that. on9e a buffer has been consumed, the execution of coprocessor 2 will stop. The 
next Instruction in the coprocessor instruction queue 8 must be a synchronisatton Instruction (that is, a 
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CC_LX_DEQREMENT)/in order to ensure that the next burst ot data has arrived into the burst buffers mennory 5. 
Following this Instruction (and, possibly, a waiting period until the data required is available), the initial address of this 
new bprst of data is assigned to the data strearri (with a CC_PORT^ADDRESS instruction), and execution is resUnned 
via a;CCl6T/\RT_EXEC instruction. The procedure Is similar tor output streams (with the important difference that 
there will be no waiting period equivalent to that required tor data to arrive' from main memory 3 into burst buffers 
memory 5). * * 

Computational Model 



[0077] An illustration 'of the overall computation model will now be described, with reference to Figure 5. The Illus- 
tration indicates how an algorithm can be recoded for use in this architecture, using as an example a sinhple vector 
addition, which can be coded in C for a conventional microprocessor as: 



16 



int a [1024], b[1024], c [1024] ; 
for ( i=0 ; i<1024 ; i++) 
a[il«b[ll+c[il ; 



20 



[0078] A piece of C code to run processor 1 which achieves on the architecture of Figure 1 the same functionality 
as the original vector addition loop nest is as follows: ' 



26 
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0: int a [1024], b[1024] c[1024l; 

1: int eo, not_eo, k; 

2: /*Port 0 specification: port #, increment, xfer size, period, 

3: phase start, phase end, start time, end time, r/w*/ 

4: CIQ_STREAM( 0, 4, 4, 3, 0, 1, 0, 3 *BLiEN*MAXK+3 , 0 ); 

5: /♦Port 1 specification*/ 

6: CIQ_STREAM( 1, 4, 4, 3, 1, 2, 0, 3*BLEN*MAXK+3 , 0 ); 

7: /*Port 2 specification*/ 

8 : CIQ_STREAM (2,4 ,4,3, 2, 3, 0, 3*BIiEy*MAXK-f 3 , 1 ); 
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9: 

10; 

11: 

12: 

13: 

14: 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28: 

29: 

30: 

31: 

32: 

33: 

34: 

35: 

36: 

37: 

38: 

39: 

40: 

41: 

42: 



BIQ_SET_MAT(0, 
BIQ_SET_MAT{1, 
BIQ_SET_MAT(2, 
BIQ_SET_BAT(0, 
BIQ_SET_BAT(2, 
BIQ_SET_BAT{4, 
for( k » 0; k . 



&bt0], BLEN*4, 4) 
&C[0] , BLEN*4, 4) 
&a[0] , BLENM« 4) 
0X0000, BLBN*4); BIQ^SET BATd/ 0x0100 
0X0200, BLENM); BIQ^SET_BAT (3 , 0x0300 
0X0400, BLEN*4); BIQ^SET BAT(5, OxOSOo! 
MAXK; k++ ) ^ 



BLEN*4) 
BLEN*4) 
BL£N*4) 



/*Even or odd iteration? 
eo « k&Oxl; 
CIQ_LXD{2); 

CIQ_SA(0, (BLEN*4*eo) ) ; 

CIQ_SA(1, ( {2*BLEN*4)+BLBN*4*eo)) ; 

CIQ_SA(2, ( (4*BLEN*4)+BLEN*4*eo) ) ' 

/♦Start Chess*/ 

CIQ_ST(3*BLEN) ; 

CIQ_XSI(1); 

/♦BB stuff*/ 
/♦Load A*/ 
BIQ_PLB(0,eo) ; 
/♦Load B*/ 
BIQ_PLB(2,2+eo) / 
BIQ_LXI(2) ; 
if ( k >« 1 ) 

not_eo « (eo«=0)?l:0; 

BIQ_XSD(1); 

BIQ_FSB (4 , 4+not_eo) ; 

} 

eo o MAXK & 0x1; 
not_eo = {eo=oO)?l:0; 
BIQ_XSD(1) ; 
BIQ_FSB (4 , 4+not eo) ; 



For double buffering*/ 



[0079] In this arrangement, three ports are used In coprocessor controller 9: one for each Input vector (b and c) and 
one for the output vector (a). The statements at lines 4. 6 and 8 are code macros to initialise these tC ^ortf These 

CC_CURRENT_PORT(0) ; 
CC^PORTJNCREMENT (4) ; 
CC_TRANSFER^SIZE(4) ; 
CC_PORT_PERIOD(3) ; 
CC.PORT_PHASE_START (0) ; 
CC_PORT^PHASE_END (1) ; 
CC_PORT_START_TIME(0) ; 
CC_PORT_END_TIME(3*BLEN*MAXK+3) ; 
CC_PORTJS_WRITE(0) ; 

m ?*BLS*MfvI?%'''' H '^^"^ ^ ^"^^ """^ ^'^^ ^2. and precisely at ticks 

t \t ^ ' ^""^ 'ncrement the address it reads from by 4 bytes each time. BLEN*MAXK Is the 

length ^^^e t^^^^ sum (In this case. 1024), and BLEN is the length of a single burst of data from DRAM (say 

64 bytes). With these values, MAXK will be set to 1 024/64=1 6 ^' 

fn m«/n m«!l!!n! ^ ^^^^Jf i^^"^^ ^""^ ^^"^^ ^""''^^ ^"'"^ ^"^^^^^ ^^'^9 ^"^"^^ ^^^^^ ^^bles toaddresses 

in mam memory 3 and burst buffers memory 5. The command BIQ_SET_MAT(0, &b[0]. BLEN*4, 4 TRUE) Is a code 
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macro that Is expanded Into BB_SET_MAT(0. &b[0], BLEN*4, 4) and ties entry 0 in the MAT to address &b[0], sets the 
burst length to be BLENM bytes (that is. BLEN integers, if an integer is 32 bits) and the stride to 4. The two lines that 
follow are similar and relate to c and a. The line BIQ_SET_BAT(0, 0x0000, BLENM) is expanded to BB_SET_BAT(0, 
0x0000. BLEN*4) and ties entfy 0 of the BAT to address 0x0000 in the burst buffers memory 5. The two lines that follow 

s are again similar; ' , 

[0082] "Up tothis point, ribcbmputation has taken place; however; cxjpTbcessor'cbhtrdMer 9 and burst buffers controller 
7 have been set up. The loop nest at lines 15 to 38 is where the actual computation takes place. This loop is repeated 
MAXK times, and each iteration operates on BLEN elements, giving a total of MAXK*BLEN elements processed. The 
loop-starts with-a-set-ofinstructions CIQ_xxx se n t to th e cbprocessortnBtrttction-queue 8 to control the activity of the 

10 coproc esso r 2 an d coprocessor con troller 9. tollowed Jby a set of instructions sent to the bu rst instruction queue 6 
whose purpose is to control the burst buffers controller 7 and the burst buffers memory 5. The relative order of these 
two sets is In principle unimportant, because the synchronisation, between tl^e different system elements is guaranteed 
explicitly by the semaphbr^^^ It would even be'possiFIFto have two distincTloops runriing after each other (provided 
that the two instruction queues were deep enough), or to have two distinct threads of control. , 

75 [0083] The CIQ_xxx lines are code macros that simplify the writing of the source code. Their meaning is the following: 

CIQ_LXD(N) inserts N CC_LXS_DECREMENT instructions In the coprocessor instruction queue 8; 
CIQ_SA(port, address) inserts a CC_CURRENT„PORT(port) and a CC_PORT_ADDRESS (address) instruction 
in the coprocessor instruction queue 8; 
20 CIQ_ST(cycieno) inserts a CC_EXECUTE_START(cycieno) Instmction in order to let the coprocessor 2 execute 

for cycleno ticks of counter 42; and 

CIQ_XSI(N) inserts N CC^XSSJNCREMENT instructions in the coprocessor inistruction queue 8. 

t 

[0084] The net effect of the code shown above is to: 

2S 

synchronise with a corresponding loadburst on the LXS semaphore; 

start the computation on coprocessor 2 for 3*BLEN ticks of counter 42; and synchronise with a corresponding 
storeburst on the XSS semaphore. 

30 [0085] The BIQ^xxx lines are again code macros that simplify the writing of the source code. Their meaning is as 
follows: 

BIQ_FLB(mate,bate) Inserts a BB_LO AD BURST(mate, bate, TRUE) instruction into the burst instruction queue 6; 
BIQ_LXI(N) inserts N BB_LXJNCREMENT instructions In the burst instruction queue 6; 
35 BIQ_FS3(mate,bate) inserts a BB_STOREfiURST(mate, bate, TRUE) instruction into the burst instruction queue 

6; and ' 
BIQ_XSD(N) Inserts N BB_XS_DECREMENT InstructkDns in the burst instruction queue 6. 

[ObSS] The net effect of the code shown above is to load two bursts from main DRAM memory 3 Into burst buffers 
40 memory 5, and then to increase the value of the LX semaphore 10 so that the coprocessor 2 can start its execution 
as described above. In all iterations but the first one, the results of the computation of coprocessor 2 are then stored 
back into main memory 3 using a storeburst instruction. It Is not strictly necessary to wait for the second iteration to 
store the result of the computation executed in the first iteration, but this enhances the parallelism between the co- 
processor 2 and the burst buffers memory 5. 
45 [0087] The use of the two variables eo and not_eo is a mechanism used here to allow the double-buffering effect 
described previously. 

[0088] Lines 39 to 42 perform the last burst transfer to main memory 3 from burst buffers memory 5, compensating 
for the absence of a storeburst Instruction in the first iteration of the loop body, 

[0089] The resulting timeline is as shown In Figure 6. Loadbursts 601 are the first activity (as until these are completed 
so the coprocessor 2 is stalled by the load/execute semaphore), and when these are completed the coprocessor 2 can 
begin to execute 602. The next instruction In the burst instruction queue 6 is another loadburst 60A , which is carried 
out as soon as the first two loads have finished.. Then, the next Instruction in the burst instruction queue 6 is a storeburst 
603, which has to wait until the XS semaphore 11 signals that the first computation on coprocessor 2 has completed. 
This process continues throughout the loop. 
55 [0090] Although the example indicated above is for a very simple algorithm, it illustrates the basic principles required 
in calculations that are more complex. The person skilled in the art could use the approach, principles and techniques 
Indicated above for programming the architecture.of Figure 1 to adapt more complex algorithms for execution by this 
architecture. 
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Tool chain for computation 

I 

J^?!.1!m ■^''t^""^'^'^* °' the computation model can be exploited in stralghtforwaid fashion by hand coding - that is 
manual y wrrt.ng C cocie to run on the CPU adapted in conventional manner to schedule the appropria^^,eraloj oi 
rr«tfr«?H'° TT*' f «PP^°Priate queues, and to set the system commoner So op 

sln^Zi^^ltf^' T configuration for the coprocessor in accordance with the s anda'd 

synthesis tools , to,- configunng that coprocessor For a configurable or FPGA-based processor like CHESS this tool 
w,n generally be a hardware description language. An appropriate hardware descrlptL language to uie for SeS 
P^cw^ro. """^ Reconfigurable Systems" by Peter Bellows and Brad Hutchings 

i^lf T , i f^ Symposium on Field-Progmmmabte Custom Computing Machines. April 19^8 

S su?h AThlTn on^T^''"^,'" ^ *° ""^^ cc^mputational architecture. The elements 

of such a toolchain and its practical operation are described briefly below. 

22!?L '""'^tto" °^ converting conventional sequential code to code adapted specifically for 

iToXrrp"u?tk^^^^^^ 

a CHESS coprocessor configuration for execution of the computation- 
burst buffer schedule for moving data between the system memory and the burst buffer memory and 
a coprocessor controller configuration for moving data between the CHESS coprocessor and the burst buffer mem- 

l^.^ toolchain itself has two components. The first is a frontend. which takes C code as its input and provides 
annota ed dependence graphs as its output The second component is a backend. whteh takes the dependence graphs 
generated by the frontend, and produces from these the CHESS configuiation. the burst buffers schedule and ttie 
coprocessor controller configuration. 

[0096] The main task of the frontend is to generate a graph which aptly describes the computation as it is to happen 
in coprocessor 2. One of the main steps performed is value-based dependence analysis, as described in W Puah and 
D. Wonnacott. "An Exact Method for Analysis of \^lue-based Array Data Dependences". University of Maryland In- 
stitute for Advanced Computer Studies - Dept. of Computer Science, University of Maryland. December 1993 The 
output generated is a description of the dataflow to be implemented in the CHESS array and a representation of all 
the addresses that need to be loaded In as inputs (via loadburst instructtons) or stored to as outputs (via storeburst 
instructions), and of the order in which data has to be retrieved from or stored to the main memory 3. This is the basis 
upon which an efficient schedule for the burst buffers controller 7 will be derived. 
[0096] If we assume, as an example, the C code for a 4-tap FIR filter 

int: i; j , arc [] , kernel [] , dst[] ; 
for{ i - 0; i < 1000/ i++ ) 
for( j = 0/ j < 4; j++ ) 

dst[ii = dst[i] + src[4+i-j]*kemel[j] ; 

as the input to the frontend. the output, provided as a text file, will have the following form: 
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I 

' ' loop : 0<=i<999 #loop nest description ' 
^ • loop : 0<= j <4 

l,e:str/0/0/20/ #store instructipn 
LCD: ' 

#Array:d[l/0/Ol at line 11 

20 : ldc/16/0/0/ #load constant' , i 

- 2 2 : s t r / 0 / 0 / 2 6/"#s tor e i ns true t ion r~wh.ich =^ , 

IjOD: 4 <e j #writes its outputs to main 
#Array:d [1/0/0] at line 13 ftmemory if 4<«j 
IS 26 :add/22/27/31/ #addition 

27 : lod/26/0/0/ #.load instruction/ taking its inputs 
Dep(16} : [0] [0] / Range: j <o 0 #£rom instruction 16 if j<BO 
Dep(22): [0] [1] / Range: 1 <« j #f rem ' instruction 22 otherwise 
LID: 

#Array:d [1/0/0] at line 13 
31:muI/26/32/37/ #multiplication 
25 32:lod/31/0/0/ #load instruction 

Dep(32): [1] [1] / Range: 1 <» i 1 <« j 

LID: i <a 0 II j <= 0 && 1 <= i #which takes its inputs from main 
#Array:8rc (1/-1/0] at line 13 #memory if i <« 0 I I j <- 0 1 <« i 

30 

37:lod/3l/0/0/ 

Dep(37) : [1] [0] / Range: 1 <» i #load instruction 
LID: i <» 0 Staking its inputs from main memoiry if 

^ #Array: kernel [0/1/0] at line 13 #i<a0 

\. 

[0097] This text file is a representation of an annotated graph. The graph itself is shown In Figure 7. The graph cleariy 
shows the dependencies found by the frontend algorithm. Edges 81 are marl^ed with the condition under which a 
40 dependence exists, and the dependence distance where applicable. The description provided contains all the Infor- 
mation necessary to generate a hardware component with the required functionality. 

[0098] The backend of the compilation toolchain has certain basic functions. One is to schedule and retime the 
extended dependence graph obtained from the frontend. This is necessary to obtain a fully functional CHESS config- 
uration. Scheduling involves determining a point in time for each of the nodes 82 in the extended dependence graph 

45 to be activated, and retiming involves, for example, the insertion of delays to ensure that edges propagate values at 
the appropriate moment. Scheduling can be performed using shifted-linear scheduling, a technique widely used in 
hardware synthesis. Retiming is a common and quite straightforward Xask in hardware synthesis, and merely involves 
adding an appropriate number of registers to the circuit so that different paths in the circuit meet at the appropriate 
point in time. At this point, we have a complete description of the functionality of the coprocessor 2 (here, a CHESS 

50 coprocessor). This description is shown in Figure 8. This description can then be passed on to the appropriate tools 
to generate the sequence of signals (commonly referred to as "bitstream") necessary to program the CHESS coproc- 
essor with this functionality. 

[0099] Another function required of the backend is generation of the burst buffer and coprocessor controller schedule. 
Once the CHESS configuration has been obtained. It Is apparent when it needs to be fed with values from main memory 
55 and when values can be stored back to main memory, and the burst buffer schedule can be established. Accordingly, 
a step is provided which involves splitting up the address space of all the data that needs to be loaded into or stored 
from the burst buffers memory 5 into fixed bursts of data that the burst buffers controller 7 is able to act upon. 
[01 00] For instance, in the FIR example just presented, the input array (srcQ) is split into several bursts of appropriate 
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sizes, such that all the address range needed for the algorithnn is coverpd. This toolchain uses bursts of length B,en 
(where 8,^^ a power of 2, and Is specified as an execution parameter to the toolchain) to cover as much of the input 
address space as possible. When no more can be achieved with this burst length, the toolchain uses bursts of de- 
creasing lengths: B^^^2, B,^„/4, B^^a. 2, 1 until every Input address needed for the algorithm belongs to one and 
only one burst. . . 

[01 01] For each one of these bursts, the earliest point in the.iteratlon space In which any of the data loaded is needed 
Is computed. In other words, to each Input burst there Is associated' one point in the iteration space for which it Is 
guaranteed that no earlier iterations need any of the data loaded by the burst. It is easy to detect when the execution 
of the coprocessor 2 would reach that point in the iteration space. There are thus created: 

a /oacr/?u/sf instruction for the relevant addresses, in order to move data into burst .buffer memory 5; and' 

a corresponding synchronisation point ( a CC_LX_DECREMENT / BB_LXJNCREMENT pair) to guarantee that 

the execution of coprocessor 2 Is synchronised with the relevant Ipadburst Instructfon. 

[0102] To achieve an efficient overlap of computation and communication, Xhejoadburst Instruction has to be issued 
in advance, In order to hide the latency associated with the transfer of data over the bus. 

[01 03] All the output address space that has to be covered by the algorithm is partitioned into output bursts, according 
to a similar logic. Again, the output space Is partitioned Into bursts of variable length. 
[0104] The toolchain creates: 

a storeburst instruction for the relevant addresses; 

a corresponding synchronization point (BB_XS_DECREMENT / CC_XSJNCREMENT pair) 
[0105] At this point, we possess Information relevant to: 

the relative ordering of loadburst and storeburst instructions, and their parameters of execution (addresses, etc.) 
their position relative to the computation to be performed on coprocessor 2. 

[0106] This Information is then used to generate appropriate C code to organise the overall computation, as In the 
FIR example described above. 

[0107] The actual code generation phase (that is, the emission of the C code to run on processor 1 ) can be accom- 
plished using the code generation routines contained in the Omega Library of the University of Maryland, available at 
http://www.cs.umd.edu/projects/omega/, followed by a customised script that translates the generic output of these 
routines into the form described above. 

Experimental Results - Image Convolution 

[0108] An Image convolution algorithm is described by the following loop nest: 

for (i-0 ; i<IMAOB_HEIGHT; 1+-!-) 
for { j » 0 ; j < IMAGB_WIDTH ; j ++ ) 
for (k»0 ; k<KERNEI*_HEIGHT ; k++) 
for (1 = 0 ; l<KERNEIi_WIDTH; 1++) 

Dest[i,jl +» Source [(i+l)-k, (j-fl) -1]*C Ik, 1] ; 



[01 09] Replication has been used to enhance the source image by KERNEL_HEIGHT-1 pixels In the vertical direction 
and KERNE L_WIDTH-1 pixels In the horizontal direction in order to simplify boundary conditions. Two kernels are used 
in evaluating system performance: a 3x3 kernel and a 5x5 kernel, both performing median filtering. 
[0110] Figures 9 and 10 illustrate the perfomnance of the architecture according to an embodiment of the invention 
(indicated as BBC) as against a conventional processor using burst buffers (Indicated as BB) and a conventional 
processor-and-cache combination (indicated as Cache). Two versions of the algorithm were Implemented, one with 
32-bit pixels and one with 8-blt pixels. The same experimental measurements were taken for different Image sizes, 
ranging from 8x8 to 128x128, and for different burst lengths. 

[0111] As can be seen from the Figures, the BBC implementation showed a great performance advantage over the 
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BB and the Qache implementations. The algorithm is relatively complex, and the overall performance of the system in 
both Bb and Cache implementations is heavily compute-bound - the CPU simply cannot keep up because of the high 
complexity of the 'algorithm. Using embodiments of the invention, in which the computation Is vastly more effective as 
it is carried' qut omthe CHESS array (with its inherent parallelism), the performance is if anything lO-bound - even 
though lO'is also efficient through effective use of burst buffers. Multimedia instructions (such as MIPS MDMX) could 
improve the perlfofmance of the CPU in the BB or the Cache implementations, as they can allow for some parallel 
execution of arithmetic, instructions. Nonetheless, the performance enhancement resulting is unlikely to reach the per- 
formance levels obtained using a dedicated coprocessor In this arrangement. 

Modifications and Variations 

I 

[0112] The functiori of decoupling the processor 1 from the coprocessor 2 and the burst buffer memory 5 can be 
achieved by nrieans other than the Instruction queues 6,8. An effective alternative is to replace the two queues with 
two small processors (one for each queue) fully dedicated to issuing instructions to the burst buffers memory 5 and 
the coprocessor 2, as described in Figure 12. The burst Instruction queue is replaced (with reference to the Figure 1 
embodiment) by a burst command processor 106, and the coprocessor instruction queue is replaced by a coprocessor 
command processor 108. Since this would be the only task carried out by these two components, there would be no 
need for them to be decoupled from the coprocessor 2 and the burst buffers 7 respectively. Each of the command 
processors 106, 108 could operate by issuing a command to the coprocessor or burst buffers (as appropriate), and 
then do nothing until that command has completed its execution, then Issue another command, and so on. This would 
complicate the design, but would free the main processor 1 from its remaining trivial task of issuing instructions into 
the queues. The only work to be carried out by processor 1 would then be the initial setting up of these two processors, 
which would be done just before the beginning of the computation. During the computation, the processor 1 would thus 
be completely decoupled from the execution of the coprocessor 2 and the burst buffers memory 5. 
[01 1 3] Two conventional, but smaller, microprocessors (or, alternatively, only one processor running two independent 
threads of control) could be used, each one of them running the relevant part of the appropriate code (loop nest). 
Alternatively, two general state machines could be synthesised whose external behaviour would reflect the execution 
of the relevant part of the code (that is, they would provide the same sequence of instructions). The hardware complexity 
and cost of such state machines would be significantly smaller than that of the equivalent dedicated processors. Such 
state machines would be programmed by the main processor 1 in a way similar to that described above. The main 
difference would be that the repetition of events would be encoded as well: this Is necessary for processor 1 to be able 
to encode the behaviour of one algorithm in a few (if complex) instructions. In order to obtain the repetition of an event 
x times, the processor 1 would not have to insert x instructions In a queue, but would have to encode this repetition 
parameter in the instruction definition. 

[0114] As indicated above, a particularly effective mechanism is for finite state machines (FSMs) to be used instead 
of queues to decouple the executipn of the main processor 1 from the execution of coprocessor 2 and the burst buffers 
controller 7. This mechanism will now be discussed In further detail. 

[0115] In the architecture illustrated in Figure 1, instructions to drive the execution of different I/O streams can be 
mixed with instructions for execution of coprocessor 2. This is possible because the mutual relationships between 
system components is known at compile time, and therefore instructions to the different system components can be 
interleaved in the source code in the correct order. 

[0116] Two state machines can be built to Issue these instructions for execution In much the same way. One such 
state machine would control the behaviour of the coprocessor 2, Issuing CC_xxx_xxx Instructions as required, and the 
other would control the behaviour of burst buffers controller 7, issuing BB_xxx_xxx instructions as required. 
[0117] Such state machines could be implemented in a number of different ways. One alternative is indicated in 
Figure 1 3. With reference to the vector addition example presented above, this state machine 150 (for the coprocessor 
2, though the equivalent machine for the burst buffers controller 7 Is directly analogous) Implements a sequence of 
instructions built from the pattern: 

CC_LX_DECREMENT, 
CC_LX_DECREMENT. 
CC„START_EXEC, 
CC_XSJNCREMENT. 

[0118] The main state machine 150 is effectively broken up into simpler state machines 151. 152, 153, each of which 
controls the execution of one kind of instruction. A period and a phase (note, this has no relationship to periods and 
phases which can be associated with I/O streams communicating between the coprocessor 2 and the burst buffers 
controller 7) is assigned to each of the simpler state machines. The hardware of state machine 1 50 will typically contain 
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foifeT^ 1^"°^ f « a number sufficient to satisfy the requirements of Intended applications 

[0119] An event counter 154 is defined. The roie of the event counter '154 is to allow instructions (in mis case for 
TT °^ '"^^ '""^ ' 5^ incremented, it there ^1^3 i,ue M 

5 1 52 53 rrS^r^'''T'"!K°' ^'^^^ "^^^^'"^ • °' '^e simpler state mach 151 

If L fnn through comparison logic 155. and its instruction is executed. It is the responsibNii^ 

o he app ,cat on.software to ensure that no two distinct state.machines can satisfy this equation When the executS 
of that ,nstructK>n .s completed, the event counter 1 54 is incremented fegain. This s'equence of eveTs can be Tum^ 

10 1: Increment event counter: EC++ . . ., 

2: Choose state machine i for execution if there exists an M such that M*Period+Phase,=EC 

« .tntnfi ""^^^ '? ' "^^^ the instruction described by stale machine i (this could include 

a suspension operation)- 

4:Gobacl<to1 • ' 

IS < 

[0120] A few eidia.parameters rejeyam to execution ofaiLin_struction (addresses to read fromAvrite to I'enqth of 
executton for a CC START.EXEC. eto.) will have to be encoded in the state machine 1 50. It should also be noted tha 
more than one state machine can issue a given instruction, typically with different parameters 
[0121] This system works particularly well-to generate periodic behaviour. However, if an event has to happen only 
Znn iT^y! ' ^ ?° encoded in a simple state machine with infinite period and finite phase, the only consequence 
being that this simple state machine will be used only the once. H"«"i-b 
2!fS=,t?'* approach can itself be varied. For example, to add flexibility to the mechanism, a possible option Is to 
add start time and end time' parameters to the simple slate nrachines. in order to limit the execution of one or more 
simple state machiines to a predetermined 'time w^indoW. 

!^^^\ P™9ranSming of these aatVmachinerwojjidhkp durin£the initialiiitfon of the system, for example 
through the use of memory-mapped registers assigned by the prolTelsor 1 . An alternative would be the loading of all 
he parameters necessary to program these state machines from a predefined region of main memory 3. perhaps 
through the use of a dedicated channel and a Direct Memory Access (DMA) mechanism 

[0124] The other alternative mechanism suggested, of using two dedicated microprocessors, would require no sig- 
nificant modification to the programming model for the architecture of Figure 1 : the same techniques used to program 
rnain processor 1 could be used, with an additional step of splitting commands intended for the coprocessor 2 from 
those intended for burst buffers controller 7. Although feasible, this arrangement may be disadvantageous with respect 
to the state machine approach. It would be necessary tor these processors to be provided with access to main memory 
3 or other DRAM, adding to the complexity of the system. The cost and complexity of the system would also be increased 
In thfs wfy " ""''®'"*"'®'"9' *h®y °"*y P^«««"t to perform very simple computations) two microprocessors 

P'^^^l developments beyond the architecture of Figure 1 and its alternatives can also be rriade without 

departing from the essential principles of the invention. Three such areas of development will be described below 
pipelines, data dependent conditionals/unltnown execution times, and non-affine accesses to memory 
[01 26] Pipeline architectures have value where applications require more than one transformation to be carried out 
on their input data streams: for instance, a convolution may be followed immediately by a correlation In order to ac- 
commodate this l<ind of arrangement, changes to both the architecture and the computational model will be required 
Architecturally, successive buffered CHESS arrays could be provided, or a larger partitioned CHESS array or a CHESS 
array reconfigured between computational stages. Figures 11 A and 11 B show different pipeline architectures effective 
to handle such applications and involving plural CHESS arrays. Figure 11 A shows an arrangement with a steggered 
CHESS/burst buffer pipeline instructed from a processor 143 and exchanging data with a main memory 144 where a 
CHESS array 1 41 receives data from a first set of burst buffers 142 and passes it to a second set of burst buffers 145 
this second set of burst buffers 145 interacting with a further CHESS array 146 (potentially this pipeline could be 
continued with further sets of CHESS arrays and burst buffers). Synchronisation becomes more complex, and involves 
communication between adjacent CHESS arrays and between adjacent sets of burst buffers, but the same general 
principles can be followed to allow efficient use of burst buffers, and efficient synchronisation between CHESS arrays- 
semaphores could be used to guarantee the correctness of the computatbn carried out by successive staaes of the 
pipeline. 

[0127] Figure 11 B shows a different type of computational pipeline, with an SRAM cache 155 between two CHESS 
arrays 151. 156. with loads provided to a first set of burst buffers 152 and stores provided by a second set of burst 
buffers 157. The role of the processor 153 and of the main memory 154 is essentially unchanged from other embodi- 
ments. Synchronisation may be less di^cuit in this arrangement, although the arrangement may also exptoit parallelism 
Hess effectively. 
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[0128] One constraint on efficient use of the coprocessor in an architecture as described above is that the execution 
time of the coprocessor implementation should be known (to allow efficient scheduling). This is achievable tor nnany 
media-processing loops. However, if execution times are unknown at compile tirne, then the scheduling requirements 
in the toclchain need to be relaxed, and appropriate allowances need to be made in the synchronisation and commu- 
s nication protocols between the processor, the coprocessor and the burst buffers. The coprocessor controller also will 
need specific configuration for this circumstance. 

[0129] Another extension is to allow non-affine references to burst buffers memory. In the burst buffers model used 
above, all access is of the type Al+F, where A is a constant matrix, I is the iteration vector and F is a constant vector. 
Use of this limited access model allows the coprocessor controller and the processor to know in advance what data 

10 will be needed at any given moment In time, allowing efficient creation of logical streams. The significance of this to 
the architecture as a whole is such that it is unclear how non-affine access could be provided in a completely arbitrary 
way (the synchronisation mechanisms would appear to break dpwn), but it would be possible to use non-affine array 
accesses to reference lookup tables. This could be done by loading lookup tables into burst buffers, and then allow 
the coprocessor to generate a burst buffer address relative to the start of the lookup table for subsequent apcess. It 

IS would be necessary to ensure that such addresses could be generated sufficiently far in advance to the time that they 
will be used (possibly this could be achieved by a refinement. to the synchronisation mechanism) and to modify the 
logical stream mechanism to support this type of recursive reference. • 

[01 30] Many variations and extensions to the architecture of Figure 1 can thus be carr'ied out without deviating from 
the invention as claimed. 

20 

Claims 

1. A computer system, comprising: 

2S 

a first processor; 

a second processor for use as a coprocessor to the first processor; 
50 a memory; 

at least one data buffer for buffering data to be written to or read from the memory in data bursts in accordance 
with burst instructions; 

3S a burst controller for executing the burst' instructions; and 

a burst instructions element for providing burst instructions in a sequence for execution by the burst controller; 

whereby burst instructions are provided by the first processor to the burst instructions element, and data is read 
40 from and written to the memory by the second processor through the at least one data buffer In accordance with 

burst instructions executed by the burst controller. 

2. A computer system as claimed in claim 1 , further comprising a synchronisation mechanism for synchronising ex- 
ecution of the coprocessor and of burst instructions with availability of data on which the coprocessor and said 

45 burst instructions are to execute. 

3. A computer system as claimed in claim 1, further comprising a coprocessor instructions element for providirig 
coprocessor instructions to control execution of the second processor in a sequence, wherein said coprocessor 
instructions are provided by the first processor. 

so 

4. A computer system as claimed In claim 3, further comprising a coprocessor controller, wherein the coprocessor 
controller receives coprocessor instructions from the coprocessor instructions element and controls execution of 
the second processor in accordance with received coprocessor instructions. 

ss 5. A computer system as claimed In claim 4, wherein the coprocessor controller controls communication between 
the coprocessor and the at least one data buffers. 

6. A computer system as claimed in claim 5, wherein a bus exists between the coprocessor controller and the at least 
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• one data buffers, and wherein the coprocessor controller controls access of separate data streams in and out of 
thQ qeconpl processor to the bus. 



7. A computer system as claimed in any of claims 3 to 6. further comprising a synchronisation mechanism for syn- 
chronising exeputjon of coprocessor instructions and burst instructions with availability of data on which .said co- 
processor Instructions and burst instructions are to execute 

i' ' ' . ' . 

8. A computer system as claimed in claim 7. wherein the synchronisation mechanism is adapted to blocl< execution 
of coprocessor instructions requiring execution of the second processor on data which has not yet been loaded to 
the at least one data buffers, and is adapted to blocl< execution of burst Instructions for storage qf data from the 
at least one data buffers to the memory where such data has not been provided to the at least one data buffers 
by the second processor. ^ 

9/ A com^)uter system Is claimed in claim 7 or claim 8, wherein the synchronisation mechanism comprises at least 
two counters adapted to be incremented and decremented by specific executed coprocessor Instructions and 
specific executed burst instructions, wherein if a counter cannot be further decremented at least one type of in- 
struction Is blocked from execution. 

10. A computer system as claimed in claim 9. wherein a first counter is incrementable by execution of a specific burst 
instruction and decrementable by execution of a specific coprocessor instruction, and when the first counter cannot 
be further decremented coprocessor Instructions for associated execution of the second processor are stalled or 
prevented. 

11. A computer system as claimed in claim 9 or claim 10. wherein a second counter Is incrementable by execution of 
a specific coprocessor instruction and decrementable by execution of a specific burst Instruction, and when the 
second counter cannot be further decremented burst instructions for associated storage of data from the at least 
one buffers to the memory are stalled or prevented. 

12. A computer system as claimed in any preceding claim depending on claim 1 or claim 3, where the or each instruc- 
^0 tions element is an instruction queue. 

13. A computer system as claimed in any preceding claim depending on claim 1 or claim 3, where the or each instruc- 
tions element is a further processor. 

55 14. A computer system as claimed in any preceding claim depending on claim 1 or claim 3. where the or each instruc- 
tbns element is a programmable state machine. 

15. A computer system as claimed in any preceding claim, wherein the first processor is the central processing unit 
of a computer device. 



16. A method of operating a computer system, comprising: 

providing code for execution by a first processor; 

extraction from the code of a task to be carried out by a second processor acting as coprocessor to the first 
processor; 

determining from the code and the task burst instructions to allow data to be read from and written to a main 
memory in data bursts for access by the second processor by means of at least one data buffer; and 

execution of the task on the coprocessor together with execution of burst instructions by a burst controller 
controlling transfer of data between the at least one data buffer and the main memory. 

17. A method as claimed in claim 16, wherein following extraction of the task from the code, coprocessor instructions 
for execution by a coprocessor controller are determined to control execution of the task by the second processor. 

18. A method as claimed in claim 17. vyherein in execution of the task, synchronisation between execution of coproc- 
essor Instructions and execution of burst Instructions Is achieved by a synchronisation mechanism. 
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19. A methoc) as claimed in claim 18, wherein said synchronisation mechanism comprises blocking of first instructions 
* unlil second Instructions whose completion is necessary for correct execution of the first Instructions have com- 
pleted. . ' • . ^ • 

. • 

s 20. A method as claimed in claim 19, wherein counters incrementable or decrementable by specific coprocessor in- 
structions and 'burst instructions are used to provide said synchronisation mechanism. 
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